I have a decent collection of wines and love them. That’s why I thought it would be a good idea to analyze this dataset.
This dataset has the physiochemical properties of 1599 red wines of the “Vinho Verde” variety. There is also a rating assigned on a 0-10 scale.
I think it will be fascinating to analyse how these properties contribute towards the overall quality of the wine.
Data was downloaded from https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityReds.csv
Read below text which describes the variables and how the data was collected. https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
Loading and Preprocess Data
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## 'data.frame': 1599 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## $ rating : Ord.factor w/ 3 levels "bad"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol quality
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 3: 10
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 4: 53
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20 5:681
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42 6:638
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 7:199
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 8: 18
## rating
## bad : 63
## average:1319
## good : 217
##
##
##
Let’s look at the first couple of rows of data, just to see what the values look like.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality rating
## 1 5 average
## 2 5 average
## 3 5 average
## 4 6 average
## 5 5 average
## 6 5 average
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## [1] 0.8248906
## [1] 3.636023
## [1] 4
First Look - Observations
82.4% of observations out of 1599 are either 5 or 6, with 5 being the most frequent. Quality is a categorical discrete , but if we were to treat it as continuous, the mean would be 3.63 and the median would be 4. Additionally, total sulfur dioxide and free sulfur dioxide appeared to be discrete variables. This is likely due to rounding issues. I would also think that citric acid is a subset of fixed acidity and potentially volatile acidity due to similar chemical properties.
To get an idea for how the data is dispersed for each variable, I am creating histograms. First, before doing any analysis between the variables, I am going to plot the distribution of each of the variable to understand the relation.
Based on the distribution shape, i.e. Normal, Positive Skew or Negative Skew, this will also help me to get some sense what to expect, when I plot different variables against each other. Also for many variables, there are extreme outliers present in this dataset. I will remove these extreme outliers for a more robust analysis.
Univariate - 00- Quality
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## bad average good
## 63 1319 217
One thing I am observered from the above two plots is most of the wines in the dataset are average quality wines. So I am wondering whether this dataset is complete or not. Was this data collected from a specific geographical location? Or was it spread around a big area? As the good quality and the poor quality wines are almost like outliers here, it might be difficult to get an accurate model of the Wine Quality.
Let’s start.
Univariate - 01- fixed.acidity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and as.numeric(rating)
## t = 5.0711, df = 1597, p-value = 4.417e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07734248 0.17383472
## sample estimates:
## cor
## 0.1258863
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 7.100 7.800 8.089 9.000 11.800
The distribution of Fixed Acidity is positively skewed (mean > median). High concentration of wines with Fixed Acidity is around 7.9 (median). The correlation between fixed.acidity and rating is weak too.
Let’s investigate this variable further by removing outliers. Reference : https://stackoverflow.com/questions/6253837/subset-data-frame-based-on-percentage
By removing outliers, we can see that the median/mean for fixed.acidity slightly decreased from 7.9 to 7.8 and from 8.32 to 8.09 respectively. The distribution remains postively skewed.
Univariate - 02- citric.acid
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## ss95$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0200 0.0700 0.1521 0.2600 0.5400
## --------------------------------------------------------
## ss95$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2400 0.2415 0.3800 0.6000
## --------------------------------------------------------
## ss95$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2800 0.3900 0.3379 0.4600 0.5900
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2400 0.2501 0.4000 0.6000
##
## Pearson's product-moment correlation
##
## data: citric.acid and as.numeric(rating)
## t = 8.5825, df = 1519, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1665936 0.2624811
## sample estimates:
## cor
## 0.2150556
Apart from some outliers, the distribution of Citric acid looks evenly distributed with very less skewness. Some higher values have no data at all(0.8 - 1.0). Maybe there was some error in the data or the data is incomplete?
Univariate - 03- volatile.acidity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5100 0.5059 0.6200 0.8400
## ss95$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.4900 0.6100 0.5907 0.6800 0.8400
## --------------------------------------------------------
## ss95$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5300 0.5208 0.6300 0.8400
## --------------------------------------------------------
## ss95$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4011 0.4800 0.8400
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and as.numeric(rating)
## t = -11.735, df = 1521, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3335343 -0.2413957
## sample estimates:
## cor
## -0.2881317
The distribution of Volatile acidity looks like Bimodal with two peaks around 0.4 and 0.6 with overall mean as .53 and .51 after removing outlier.
Univariate - 04- residual.sugar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.283 2.500 5.100
## ss95$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.20 1.85 2.10 2.35 2.50 4.50
## --------------------------------------------------------
## ss95$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.272 2.500 5.100
## --------------------------------------------------------
## ss95$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.200 2.338 2.500 5.000
##
## Pearson's product-moment correlation
##
## data: residual.sugar and as.numeric(rating)
## t = 0.68459, df = 1518, p-value = 0.4937
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03273993 0.06778764
## sample estimates:
## cor
## 0.01756825
The distribution of Residual Sugar is again positively skewed with high peaks at around 2.3 with many outliers present at the higher ranges.
Univariate - 05- chlorides
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07800 0.07914 0.08800 0.12600
## ss95$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06700 0.07850 0.07773 0.08725 0.12300
## --------------------------------------------------------
## ss95$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.07100 0.07900 0.08019 0.08800 0.12600
## --------------------------------------------------------
## ss95$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07200 0.07337 0.08500 0.12400
##
## Pearson's product-moment correlation
##
## data: chlorides and as.numeric(rating)
## t = -4.5271, df = 1517, p-value = 6.447e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.16479428 -0.06554018
## sample estimates:
## cor
## -0.1154554
The distribution of chloride is also positively skewed with high peaks at around 0.08 with many outliers present at the higher ranges. The boxplot against rating also depicts that 0.07 - .08 is majority of wines across all ratings.
Univariate - 06- free.sulfur.dioxide
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 13.00 14.46 20.00 35.00
## ss95$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 5.0 9.0 11.2 15.0 34.0
## --------------------------------------------------------
## ss95$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 14.00 14.97 21.00 35.00
## --------------------------------------------------------
## ss95$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 10.00 12.37 16.00 35.00
##
## Pearson's product-moment correlation
##
## data: free.sulfur.dioxide and as.numeric(rating)
## t = -1.735, df = 1520, p-value = 0.08294
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.094493843 0.005800448
## sample estimates:
## cor
## -0.04445872
For Free Sulphur Dioxide, there is a high peak at 7 but then it again follows the same positively skewed long tailed patterns with some outliers in the high range.The median drops from 14 to 13 when 5% outliers are removed
Univariate - 07- total.sulfur.dioxide
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 21.00 36.00 41.76 57.00 112.00
## ss95$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 13.00 24.00 31.77 47.00 86.00
## --------------------------------------------------------
## ss95$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 23.00 38.00 43.84 60.00 112.00
## --------------------------------------------------------
## ss95$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.00 26.00 32.58 43.00 106.00
##
## Pearson's product-moment correlation
##
## data: total.sulfur.dioxide and as.numeric(rating)
## t = -3.3084, df = 1517, p-value = 0.0009602
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.13436194 -0.03448909
## sample estimates:
## cor
## -0.08463809
Total Sulphur Dioxide also follows a same pattern as Free Sulphur Dioxide. There is a high peak at 7 but then it again follows the same positively skewed long tailed patterns with some outliers in the high range.
It seems there are big outliers in this variable. There are almost no observation between 160-270 and it seems they does not seems valid data with Free Sulphur Dioxide > 170 As SO2 is mostly undetectable in wine, but at free SO2 concentrations, over 50 ppm, SO2 becomes evident in the nose and taste of wine.
Univariate - 08- density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9955 0.9966 0.9965 0.9976 1.0000
## ss95$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9956 0.9965 0.9965 0.9975 0.9996
## --------------------------------------------------------
## ss95$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9957 0.9968 0.9967 0.9978 1.0000
## --------------------------------------------------------
## ss95$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9947 0.9956 0.9958 0.9973 1.0000
##
## Pearson's product-moment correlation
##
## data: density and as.numeric(rating)
## t = -5.6915, df = 1526, p-value = 1.507e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.19292766 -0.09471243
## sample estimates:
## cor
## -0.1441751
The density is fairly normal distribute with mean, median and mode almost equals 0.9967
Univariate - 09- pH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.200 3.300 3.294 3.390 3.570
## ss95$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.292 3.345 3.339 3.410 3.550
## --------------------------------------------------------
## ss95$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.200 3.300 3.296 3.390 3.570
## --------------------------------------------------------
## ss95$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.190 3.270 3.272 3.360 3.570
##
## Pearson's product-moment correlation
##
## data: pH and as.numeric(rating)
## t = -3.2928, df = 1524, p-value = 0.001015
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.13366496 -0.03401156
## sample estimates:
## cor
## -0.08404841
The pH is fairly normal distribute with mean, median and mode almost equals 3.56
Univariate - 10- sulphates
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6100 0.6321 0.7100 0.9300
## ss95$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5550 0.5518 0.5925 0.8600
## --------------------------------------------------------
## ss95$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3700 0.5400 0.6000 0.6209 0.6800 0.9300
## --------------------------------------------------------
## ss95$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7300 0.7242 0.8100 0.9200
##
## Pearson's product-moment correlation
##
## data: sulphates and as.numeric(rating)
## t = 13.694, df = 1518, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2860754 0.3756025
## sample estimates:
## cor
## 0.3315852
The sulphates is postively skewed
Univariate - 11- alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.4 9.5 10.1 10.3 11.0 12.5
## ss95$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.17 10.97 12.00
## --------------------------------------------------------
## ss95$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.17 10.80 12.50
## --------------------------------------------------------
## ss95$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.72 11.30 11.26 11.90 12.50
##
## Pearson's product-moment correlation
##
## data: alcohol and as.numeric(rating)
## t = 13.975, df = 1527, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2915398 0.3804575
## sample estimates:
## cor
## 0.3367491
The alcohol is postively skewed. Its interesting to see that good wines have considerably higher alcohol concenration when compared to bad or average.
Eleven of these attributes are physiochemical properties of the wine. They may or may not contribute to the quality, which is scored from 0 to 10.
The Red Wine Dataset had 1599 rows and 13 columns originally. After I added a new column called ‘rating’, the number of columns became 14. Here our categorical variable is ‘quality’, and the rest of the variables are numerical variables which reflect the physiochemical properties of the wine.
Most variables have long-tail distributions. Few, like measures of acidity and density, are well behaved normal distributions.
There are some sweeter wines, with about 80 observations have more than 5 g/L. It’s good to bear in mind that wines are only considered sweet at about 45 g/L and the highest value we observed is 15.5. So there really isn’t any “sweet” wine in our dataset.
I also see that in this dataset, most of the wines belong to the ‘average’ quality with very few ‘bad’ and ‘good’ ones. Now this again raises my doubt if this dataset is a complete one or not. For the lack of these data, it might be challenging to build a predictive model as I don’t have enough data for the Good Quality and the Bad Quality wines.
I’m most interested in the quality variable and how the others affect it. The quality is between 0-10, but we only have observations with a max of 8 and min of 3. The average quality is NA.
Without analyzing the data, I think maybe the alchohol and acidity(fixed, volatile or citric) will change the quality of wine based on their values. Also pH as related to acidity may have some effect on the quality. Also this would be an interesting thing to see how the pH is affected by the different acids present in the wine and if the overall pH affects the quality of the wine. I also think the residual sugar will have an effect on the wine quality as sugar determines how sweet the wine will be and may adversely affect the taste of the wine.
New factored variable ‘Rating’ was created to for effective grouping of observations so that we have meaningful sample for each category. - Bad/Average and Good.
###Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Variable X was removed from the dataset, as it was just an index which has no meaning for data analysis. Citric acid has a unique distribution as compared to the other numeric variables. It almost has a rectangular shape apart from a few outliers. Now if we compare the rating distribution, this distribution of Citric Acid is very unexpected and maybe there is even a possibility of incomplete data collection.
1.Density and pH seems normally distributed with few outliers. 2.Residual sugar and Chloride seems to have extreme outliers. 3.Fixed and volatile acidity, total and free sulfur dioxides, alcohol and sulphates seem to be long-tailed for the outliers present. 4.Citric acid has large number of zero values. I wonder if this is due to incomplete data entry.
Lets start with computing correlation between different variable Reference : https://www.statmethods.net/stats/correlations.html
##
## ---------------------------------------------------------------------------
## fixed.acidity volatile.acidity citric.acid
## -------------------------- --------------- ------------------ -------------
## **fixed.acidity** 1 -0.2561 **0.6717**
##
## **volatile.acidity** -0.2561 1 **-0.5525**
##
## **citric.acid** **0.6717** **-0.5525** 1
##
## **residual.sugar** 0.1148 0.001918 0.1436
##
## **chlorides** 0.09371 0.0613 0.2038
##
## **free.sulfur.dioxide** -0.1538 -0.0105 -0.06098
##
## **total.sulfur.dioxide** -0.1132 0.07647 0.03553
##
## **density** **0.668** 0.02203 **0.3649**
##
## **pH** **-0.683** 0.2349 **-0.5419**
##
## **sulphates** 0.183 -0.261 **0.3128**
##
## **alcohol** -0.06167 -0.2023 0.1099
##
## **quality** 0.1241 **-0.3906** 0.2264
## ---------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## ------------------------------------------------------------------------------
## residual.sugar chlorides free.sulfur.dioxide
## -------------------------- ---------------- ------------ ---------------------
## **fixed.acidity** 0.1148 0.09371 -0.1538
##
## **volatile.acidity** 0.001918 0.0613 -0.0105
##
## **citric.acid** 0.1436 0.2038 -0.06098
##
## **residual.sugar** 1 0.05561 0.187
##
## **chlorides** 0.05561 1 0.005562
##
## **free.sulfur.dioxide** 0.187 0.005562 1
##
## **total.sulfur.dioxide** 0.203 0.0474 **0.6677**
##
## **density** **0.3553** 0.2006 -0.02195
##
## **pH** -0.08565 -0.265 0.07038
##
## **sulphates** 0.005527 **0.3713** 0.05166
##
## **alcohol** 0.04208 -0.2211 -0.06941
##
## **quality** 0.01373 -0.1289 -0.05066
## ------------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------------------------------------
## total.sulfur.dioxide density pH
## -------------------------- ---------------------- ------------- -------------
## **fixed.acidity** -0.1132 **0.668** **-0.683**
##
## **volatile.acidity** 0.07647 0.02203 0.2349
##
## **citric.acid** 0.03553 **0.3649** **-0.5419**
##
## **residual.sugar** 0.203 **0.3553** -0.08565
##
## **chlorides** 0.0474 0.2006 -0.265
##
## **free.sulfur.dioxide** **0.6677** -0.02195 0.07038
##
## **total.sulfur.dioxide** 1 0.07127 -0.06649
##
## **density** 0.07127 1 **-0.3417**
##
## **pH** -0.06649 **-0.3417** 1
##
## **sulphates** 0.04295 0.1485 -0.1966
##
## **alcohol** -0.2057 **-0.4962** 0.2056
##
## **quality** -0.1851 -0.1749 -0.05773
## -----------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -------------------------------------------------------------------
## sulphates alcohol quality
## -------------------------- ------------ ------------- -------------
## **fixed.acidity** 0.183 -0.06167 0.1241
##
## **volatile.acidity** -0.261 -0.2023 **-0.3906**
##
## **citric.acid** **0.3128** 0.1099 0.2264
##
## **residual.sugar** 0.005527 0.04208 0.01373
##
## **chlorides** **0.3713** -0.2211 -0.1289
##
## **free.sulfur.dioxide** 0.05166 -0.06941 -0.05066
##
## **total.sulfur.dioxide** 0.04295 -0.2057 -0.1851
##
## **density** 0.1485 **-0.4962** -0.1749
##
## **pH** -0.1966 0.2056 -0.05773
##
## **sulphates** 1 0.09359 0.2514
##
## **alcohol** 0.09359 1 **0.4762**
##
## **quality** 0.2514 **0.4762** 1
## -------------------------------------------------------------------
The strong correlation between two variable is highlighted using **. It good to see strong negative correlation between volatile.acidity and quality (-0.3906),as volatile acid is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste The positive correlation between quality and alcohol seems correct. (0.4762) The strong correlation between citric.acid and fixed.acidity(0.6717) and volatile.acidity(-0.5525) is expected.
However, the positive correlation between volatile.acidity and pH seems wierd. (0.2349) As we know that increase in acidity decreases the pH. So is it possible that a Simpson’s Paradox is at play here, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. https://en.wikipedia.org/wiki/Simpson%27s_paradox
Density has a very strong correlation with Fixed Acidity. (0.668) The variables most strongly correlated to quality are Volatile Acidity (-0.3906) and Alcohol (0.4762).
Alcohol has negative correlation with density (-0.4962). This is evident from the fact that the density of water is greater than the density of alcohol.
Now let us create some Box plots between these variables to see if I have missed anything from the correlation table.
quality vs fixed.acidity
As we can see, Fixed Acidity has almost no effect on the Quality. The mean and median values of fixed acidity remains almost unchanged with increase in quality.
quality vs volatile.acidity
Volatile acid seems to have a negative impact on the quality of the wine. As volatile acid level goes up, the quality of the wine degrades.
quality vs citric.acid
Citric acid seems to have a positive correlation with Wine Quality. Better wines have higher Citric Acid.
quality vs residual.sugar
Previously I thought that Residual Sugar may have an effect on the wine quality. But this plot contradicts that assumption and shows that Residual Sugar almost has no effect on the Quality of the Wine. The mean values for the residual sugar is almost the same for every quality of wine.
quality vs chlorides
Even though weakly correlated, from the decrease in median values of the Chlorides with increase in quality, it seems that lower percent of Chloride seems to produce better wines.
quality vs free.sulfur.dioxide
Now this is an interesting observation. We see that too low concentration of Free Sulphur Dioxide produces poor wine or good quality but too high concentration results in average wine.
quality vs total.sulfur.dioxide
As this is a Subset of Free Sulphur Dioxide, we see a similar pattern here.
quality vs density
Better wines seems to have lower densities. However there might be a possibility that the low density is due to higher alcohol content which actually is the driving factor for better wines.
quality vs pH
Better wines seems to have less pH, i.e they are more acidic. But there are a quite a few outliers here. So maybe the next logical thing would be to see how the individual acids affects the pH.
fixed.acidity vs pH
This seems logical as pH is decreasing as fixed.acidity is increasing
volatile.acidity vs pH
It seems volatile.acidity is causing Simpson paradox. pH is increasing as volatile.acidity is increasing.
citric.acid vs pH
These three plots make us come back to our old question. Recall that we saw for volatile.acidity, pH has a positive correlation. But we know acidity has a negative correlation with pH. So is it possible, that we are seeing a Simpson’s Paradox at play here? Let’s investigate.
Simpsons Check
So it is indeed Simpson’s paradox which was responsible for the trend reversal of Volatile Acid vs pH. I clustered the data into 3 segments and calculated the regression coefficient. I see that there is indeed a sign reversal. This is due to a lurking variable which changes the overall coefficient.
quality vs sulphates
Even though we see many outliers in the ‘Average’ quality wine, it seems that better wines have a stronger concentration of Sulphates.
quality vs alcohol
The correlation is really distinct here. It is pretty evident that better wines have higher Alcohol content in it. But we see a great number of outliers here. So it might be possible that alcohol alone does not contribute to a wine being a good quality one. Let’s make a simple linear model and try to get the statistics here.
##
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = rw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.12503 0.17471 -0.716 0.474
## alcohol 0.36084 0.01668 21.639 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
Based on the value of R squared, we see that Alcohol alone contributes to only about 22% of the Wine quality. So there must be other variables at play here. I have to figure them out in order to build a better regression model. So now I will put a correlation test against each variable to the quality of the wine.
Final correlation between variable and quality of wine
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## log10.residual.sugar log10.chlordies free.sulfur.dioxide
## 0.02353331 -0.17613996 -0.05065606
## total.sulfur.dioxide density pH
## -0.18510029 -0.17491923 -0.05773139
## log10.sulphates alcohol
## 0.30864193 0.47616632
investigation. How did the feature(s) of interest vary with other features in
the dataset?
Observations
Fixed Acidity seems to have almost no effect on quality. Volatile Acidity seems to have a negative correlation with the quality. Better wines seem to have higher concentration of Citric Acid. Better wines seem to have higher alcohol percentages. But when I created a linear model around it, I saw from the R squared value that alcohol by itself only contributes like 20% on the variance of the quality. So there may be some other factors at play here. Even though it’s a weak correlation, but lower percent of Chloride seems to produce better quality wines. Better wines seem to have lower densities. But then again, this may be due to the higher alcohol content in them. Better wines seem to be more acidic. Residual sugar almost has no effect on the wine quality.
(not the main feature(s) of interest)? >Volatile acidity had a positive correlation with pH which is unexpected. Later we found out that this was due to the Simpson’s Paradox.
From the correlation test, it seems that the following variables have a higher correlation to Wine Quality. 1.Alcohol 2.Sulphates(log10) 3.Volatile Acidity 4.Citric Acid
As we analysed, that alcohol plays a strong part in the quality of the wine even though it actually contributes only 22% of the total quality, now I will first make alcohol constant and try to insert a few more variables to see if they contribute to the overall quality in any other way.
With constant Alcohol, density does not seem to play a prominet role in changing the quality of the alcohol. So our previous suspicion must be true that the correlation we were seeing of density with quality was due to alcohol percent.
It looks like Wines with higher alcohol content produce better wine if they have higher level of Sulphates.
It looks like Volatile acid has just the opposite effect. With less concentration of volatile acid and higher concentration of alcohol seems to produce better wines.
Here also, low pH and high Alcohol percentage seems to produce better wines.
No such correlation between residual sugar and quality.
In general lower Sulphur Dioxide seems to produces better wine even though some high outliers for better wine with high Sulphur Dioxide.
Now let us try to investigate the effect of Acids on the Quality of Wines.
Higher Citric Acid and low Volatile Acid seems to produce better Wines.
I don’t see much correlations here.
Again, I don’t get much correlation with the quality here.
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest? > 1. High Alcohol and
Sulphate content seems to produce better wines. 2. Citric Acid, even
though weakly correlated plays a part in improving the wine quality.
Citric Acid, even though weakly correlated plays a part in improving the wine quality.
I saw that the Alcohol and Sulphates played a major role in determining alcohol quality. Also in the final linear model I made, I also plotted the error value against the quality which shows us the variation in the error percentage with different qualities of Wine. I think these three plots are very crtical plots for this project. So I decided to include these three plots in the Final Plots and Summary section.
## $title
## [1] "Influence of alcohol on wine quality"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
This plot tells us that Alcohol percentage has played a big role in determining the quality of Wines. The higher the alcohol percentage, the better the wine quality. In this dataset, even though most of the data pertains to average quality wine, we can see from the above plot that the mean and median coincides for all the boxes implying that for a particular Quality it is very normally distributed. So a very high value of the median in the best quality wines imply that almost all points have a high percentage of alcohol. But previously from our linear model test, we saw from the R Squared value that alcohol alone contributes to about 22% in the variance of the wine quality. So alcohol is not the only factor which is responsible for the improvement in Wine Quality.
In this plot, we see that the best quality wines have high values for both Alcohol percentage and Sulphate concentration implying that High alcohol contents and high sulphate concentrations together seem to produce better wines. Although there is a very slight downwards slope maybe because in best quality wines, percentage of alcohol is slightly greater than the concentration of Sulphates.
This is perhaps the most meaningful graph. I subset the data to remove the ‘average’ wines, or any wine with a rating of 5 or 6. As the correlation tests show, wine quality was affected most strongly by alcohol and volaticle acidity. and alcohol and sulphates.
While the boundaries are not as clear cut or modal, it’s apparent that high volatile acidity–with few exceptions–kept wine quality down. A combination of high alcohol content and low volatile acidity produced better wines. At the same time, combination of high alcohol content and high sulphates produced better wines.
Through this exploratory data analysis, I identified the key factors that determine and drive wine quality, mainly: alcohol content, sulphates, and acidity. It is important to note, however, that wine quality is ultimately a subjective measure, albeit measured by wine experts. That said, the correlations for these variables are within reasonable bounds. The graphs adequately illustrate the factors that make good wines ‘good’ and bad wines ‘bad’.
In this data, my main struggle was to get a higher confidence level when predicting factors that are responsible for the production of different quality since the data was very centralized towards the ‘Average’ quality.
critic acid has lots of 0 in the dataset, and which became eveident when I found out that acid actually is added to some wines to increase the acidity. So it’s evident that some wines would not have Citric Acid at all.
The other variables showed either a Positively skewed or a Normal Distribution.
First I plotted different variables against the quality to see Univariate relationships between them and then one by one I threw in one or more external factors to see if they together have any effect on the categorical variable. I saw that the factors which affected the quality of the wine the most were Alcohol percentage, Sulphate and Acid concentrations.
I tried to figure out the effect of each individual acid on the overall pH of the wine. Here I found out a very peculiar phenomenon where I saw that for volatile acids, the pH was increasing with acidity which was strange.
In the final part of my analysis, I plotted multivariate plots to see if there were some interesting combinations of variables which together affected the overall quality of the wine. It was in this section I found out that density did not play a part in improving wine quality.
Although our dataset is rather large, 1599 wines, it could have been useful to have more wines, or at least, wines with a more equal distribution across the values of quality. This could have made it easier to distinguish what makes a good wine, versus a bad wine. Having just 1.1% of wines with a quality value of 8 made it somewhat difficult to determine what makes them different from the other wines.
For future analysis about quality of wine, there are a few things that I would like to have incorporated. I would like to investigate the impact of weather and season particularly on the quality of wine. Would like to understand if good wines are always expensive. For the given dataset, I think quality/type of grapes would be interesting feature to add.